NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

https://doi.org/10.18653/v1/2024.emnlp-main.543

Feucht, Sheridan; Atkinson, David; Wallace, Byron C; Bau, David (January 2024, Association for Computational Linguistics)

LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b’s tokenizer splits the word “patrolling” into two tokens, “pat” and “rolling”, neither of which correspond to semantically meaningful units like “patrol” or "-ing.” Similarly, the overall meanings of named entities like “Neil Young” and multi-word expressions like “break a leg” cannot be directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced “erasure” effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method to “read out” the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.
more » « less
Full Text Available
AGGA: A Dataset of Academic Guidelines for Generative AIs

https://doi.org/10.7910/DVN/XZZHA5

Jiao, Junfeng; Afroogh, Saleh; Chen, Kevin; Atkinson, David; Dhurandhar, Amit; Afroogh, Saleh (January 2024, Harvard Dataverse)

AGGA (Academic Guidelines for Generative AIs) is a dataset of 80 academic guidelines for the usage of generative AIs and large language models in academia, selected systematically and collected from official university websites across six continents. Comprising 181,225 words, the dataset supports natural language processing tasks such as language modeling, sentiment and semantic analysis, model synthesis, classification, and topic labeling. It can also serve as a benchmark for ambiguity detection and requirements categorization. This resource aims to facilitate research on AI governance in educational contexts, promoting a deeper understanding of the integration of AI technologies in academia.
more » « less
Integrative Approaches to Understanding Organismal Responses to Aquatic Deoxygenation

https://doi.org/10.1086/722899

Woods, H. Arthur; Moran, Amy L.; Atkinson, David; Audzijonyte, Asta; Berenbrink, Michael; Borges, Francisco O.; Burnett, Karen G.; Burnett, Louis E.; Coates, Christopher J.; Collin, Rachel; et al (October 2022, The Biological Bulletin)

Oxygen bioavailability is declining in aquatic systems worldwide as a result of climate change and other anthropogenic stressors. For aquatic organisms, the consequences are poorly known but are likely to reflect both direct effects of declining oxygen bioavailability and interactions between oxygen and other stressors, including two—warming and acidification— that have received substantial attention in recent decades and that typically accompany oxygen changes. Drawing on the collected papers in this symposium volume (“An Oxygen Perspective on Climate Change”), we outline the causes and consequences of declining oxygen bioavailability. First, we discuss the scope of natural and predicted anthropogenic changes in aquatic oxygen levels. Although modern organisms are the result of long evolutionary histories during which they were exposed to natural oxygen regimes, anthropogenic change is now exposing them to more extreme conditions and novel combinations of low oxygen with other stressors. Second, we identify behavioral and physiological mechanisms that underlie the interactive effects of oxygen with other stressors, and we assess the range of potential organismal responses to oxygen limitation that occur across levels of biological organization and over multiple timescales. We argue that metabolism and energetics provide a powerful and unifying framework for understanding organism-oxygen interactions. Third,we conclude by outlining a set of approaches for maximizing the effectiveness of future work, including focusing on long-term experiments using biologically realistic variation in experimental factors and taking truly cross disciplinary and integrative approaches to understanding and predicting future effects.
more » « less
Full Text Available

Search for: All records